2 research outputs found
Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity
While deep learning (DL) models are state-of-the-art in text and image
domains, they have not yet consistently outperformed Gradient Boosted Decision
Trees (GBDTs) on tabular Learning-To-Rank (LTR) problems. Most of the recent
performance gains attained by DL models in text and image tasks have used
unsupervised pretraining, which exploits orders of magnitude more unlabeled
data than labeled data. To the best of our knowledge, unsupervised pretraining
has not been applied to the LTR problem, which often produces vast amounts of
unlabeled data.
In this work, we study whether unsupervised pretraining of deep models can
improve LTR performance over GBDTs and other non-pretrained models. By
incorporating simple design choices--including SimCLR-Rank, an LTR-specific
pretraining loss--we produce pretrained deep learning models that consistently
(across datasets) outperform GBDTs (and other non-pretrained rankers) in the
case where there is more unlabeled data than labeled data. This performance
improvement occurs not only on average but also on outlier queries. We base our
empirical conclusions off of experiments on (1) public benchmark tabular LTR
datasets, and (2) a large industry-scale proprietary ranking dataset. Code is
provided at https://anonymous.4open.science/r/ltr-pretrain-0DAD/README.md.Comment: ICML-MFPL 2023 Workshop Ora